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ABSTRACT 


Spaced repetition is among the most studied learning strate- 
gies in the cognitive science literature. It consists in tem- 
porally distributing exposure to an information so as to 
improve long-term memorization. Providing students with 
an adaptive and personalized distributed practice schedule 
would benefit more than just a generic scheduler. However, 
the applicability of such adaptive schedulers seems to be 
limited to pure memorization, e.g. flashcards or foreign lan- 
guage learning. In this article, we first frame the research 
problem of optimizing an adaptive and personalized spaced 
repetition scheduler when memorization concerns the appli- 
cation of underlying multiple skills. To this end, we choose 
to rely on a student model for inferring knowledge state and 
memory dynamics on any skill or combination of skills. We 
argue that no knowledge tracing model takes both memory 
decay and multiple skill tagging into account for predicting 
student performance. As a consequence, we propose a new 
student learning and forgetting model suited to our research 
problem: DAS3H builds on the additive factor models and 
includes a representation of the temporal distribution of past 
practice on the skills involved by an item. In particular, 
DAS3H allows the learning and forgetting curves to differ 
from one skill to another. Finally, we provide empirical evi- 
dence on three real-world educational datasets that DAS3H 
outperforms other state-of-the-art EDM models. These re- 
sults suggest that incorporating both item-skill relationships 
and forgetting effect improves over student models that con- 
sider one or the other. 
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1. INTRODUCTION 


Learners have to manage their studying time wisely: they 
constantly have to make a trade-off between acquiring new 
knowledge and reviewing previously encountered learning 
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material. Considering that learning often involves build- 
ing on old knowledge (e.g. in mathematics) and that efforts 
undertaken in studying new concepts may be significant, 
this issue should not be taken lightly. However, only few 
school incentive structures encourage long-term retention, 
making students often favor short-term memorization and 
poor learning practices [37, 22]. 


Fortunately, there are simple learning strategies that help 
students efficiently manage their learning time and improve 
long-term memory retention at a small cost. Among them, 
the spacing and the testing effects have been widely repli- 
cated [36, 7] since their discovery in the 19*® century. Both of 
them are recommended by cognitive scientists [24, 46] in or- 
der to improve public instruction. The spacing effect states 
that temporally distributing learning episodes is more bene- 
ficial to long-term memory than learning in a single massed 
study session. The testing effect [35, 5] — also known as 
retrieval practice — basically consists in self-testing after be- 
ing exposed to new knowledge instead of simply reading the 
lesson again. This test can take multiple forms: free recall, 
cued recall, multiple-choice questions, application exercises, 
and so on. A recent meta-analysis on the testing effect [1] 
found a strong and positive overall effect size of g = 0.61 for 
testing compared to non-testing reviewing strategies. An- 
other meta-analysis [23] investigated whether learning with 
retrieval practice could transfer to different contexts and 
found a medium yet positive overall transfer effect size of 
d = 0.40. Combining both strategies is called spaced re- 
trieval practice: temporally distributing tests after a first 
exposure to knowledge. 


Recent research effort has been put on developing adap- 
tive and personalized spacing schedulers for improving long- 
term retention of flashcards [40, 33, 18]. Compared to non- 
adaptive schedulers, they show substantial improvement of 
the learners’ retention at immediate and delayed tests [19]. 
However, and to the best of our knowledge, there is no work 
on extending these algorithms when knowledge to be remem- 
bered concerns the application of underlying skills. Yet, the 
spacing effect is not limited to vocabulary learning or even 
pure memorization: it has been successfully applied to the 
acquisition and generalization of abstract science concepts 
[44] and to the practice of mathematical skills in a real edu- 
cational setting [3]. Conversely, most models encountered in 
knowledge tracing involve multiple skills, but do not model 
forgetting. The goal of the present article is to start fill- 
ing this gap by developing a student learning and forgetting 
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model for inferring skills knowledge state and memory dy- 
namics. This model will serve as a basis for the future devel- 
opment of adaptive and personalized skill practice schedul- 
ing algorithms for improving learners’ long-term memory. 


Our contribution is two-fold. We first frame our research 
problem for extending the flashcards-based adaptive spacing 
framework to contexts where memorization concerns the ap- 
plication of underlying skills. In that perspective, students 
learn and reinforce skill mastery by practicing items involv- 
ing that skill. We argue that this extension requires new 
student models to model learning and forgetting processes 
when multiple skills are involved by a single item. Thus, 
we also propose a new student model, coined DAS3H, that 
extends DASH [18, 22] and accounts for memory decay and 
the benefits of practice when an item can involve multiple 
knowledge components. Finally, we provide empirical evi- 
dence on three publicly available datasets showing that our 
model outperforms other state-of-the-art student models. 


2. RELATED WORK 


In this section, we first detail related work on adaptive spac- 
ing algorithms before turning to student modeling. 


In what follows, we will index students by s € []1, S], items 
(or questions, exercises) by j € [1, J], skills or knowledge 
components (KCs) by k € [[1, K], and timestamps by t € RT 
(in days). To be more convenient, we assume that times- 
tamps are encoded as the number of days elapsed since the 
first interaction with the system. It is sufficient because 
we only need to know the duration between two interac- 
tions. Ys,;,4 € {0,1} gives the binary correctness of student 
S$ answering item j at time ¢. o is the logistic function: 
Va € R,o(x) = 1/(1 + exp(—2z)). KC(.) takes as input an 
item index j and outputs the set of skill indices involved by 
item j. 


Let us quickly detail what we mean by skill. In this arti- 
cle, we assimilate skills and knowledge components. Knowl- 
edge components are atomistic components of knowledge by 
which items are tagged. An item may have one or more KCs, 
and this information is synthesized by a so-called binary q- 
matrix [41]: V(j,k) € [1,J] x [1, A], ae = Lrexccj). We 
assume that the probability of answering correctly an item 
j that involves skill k depends on the student’s mastery of 
skill k; conversely, we measure skill mastery by the ability 
of student s to remember skill k and apply it to solve any 
(possibly unseen) item that involves skill k. 


2.1 Adaptive spacing algorithms 

Adaptive spacing schedulers leverage the spaced retrieval 
learning strategy to maximize learning and retention of a 
set of items. They proceed by sequentially deciding which 
item to ask the user at any time based on the user’s past 
study history. Items to memorize are often represented by 
flashcards, i.e. cards on which one side contains the question 
(e.g. When did the Great Fire of London occur? or What 
is the correct translation of “manger” in English?) and the 
other side contains the answer. 


Early adaptive spacing systems made use of physical flash- 
cards [17] but the advent of computer-assisted instruction 
made possible the development of electronic flashcards [51], 


thus allowing more complex and personalized strategies for 
optimal reviewing. Nowadays, several adaptive spacing soft- 
wares are available to the general public, e.g. Anki’, Super- 
Memo?, and Mnemosyne’. 


Originally, adaptive reviewing systems took decisions based 
on heuristics and handmade rules [17, 30, 51]. Though 
maybe effective in practice [20], these early systems lack per- 
formance guarantees [40]. Recent research works started to 
tackle this issue: for instance, Reddy et al. propose a math- 
ematical formalization of the Leitner system and a heuristic 
approximation used for optimizing the review schedule [32]. 


A common approach for designing spaced repetition adap- 
tive schedulers consists in modeling human memory statisti- 
cally and recommending the item whose memory strength is 
closest to a fixed value 6 [22, 18, 20]. Khajah, Lindsey and 
Mozer found that this simple heuristic is only slightly less ef- 
ficient than exhaustive policy search in many situations [14]. 
It has the additional advantage to fit into the notion of “de- 
sirable difficulties” coined by Bjork [4]. Pavlik and Anderson 
[26] use an extended version of ACT-R declarative memory 
model to build an adaptive scheduler for optimizing item 
practice (in their case, Japanese-English word pairs) given 
a limited amount of time. ACT-R is originally capable of 
predicting item correctness and speed of recall by taking re- 
cency and frequency of practice into account. Pavlik and 
Anderson extend ACT-R to capture the spacing effect as 
well as item, learner, and item-learner interaction variabil- 
ity. The adaptive scheduler uses the model estimation of 
memory strength gain at retention test per unit of time to 
decide when to present each pair of words to a learner. 


Other approaches do not rely on any memory model: Reddy, 
Levine and Dragan formalize this problem as a POMDP 
(Partially Observable Markov Decision Process) and approx- 
imately solve it within a deep reinforcement learning archi- 
tecture [33]. However, they only test their algorithm on 
simulated students. A more recent work [40] formalizes the 
spaced repetition problem with marked temporal point pro- 
cesses and solves a stochastic optimal control problem to 
optimally schedule spaced review of items. Mettler, Massey 
and Kellman [19] compare an adaptive spacing scheduler 
(ARTS) to two fixed spacing conditions. ARTS leverages 
students’ response times, performance, and number of tri- 
als to dynamically compute a priority score for adaptively 
scheduling item practice. Response time is used as an indi- 
cator of retrieval difficulty and thus, learning strength. 


Our work can more generally relate to the problem of au- 
tomatic optimization of teaching sequences. Rafferty et al. 
formulate this problem as a POMDP planning problem [31]. 
Whitehill and Movellan build on this work but use a hier- 
archical control architecture for selecting optimal teaching 
actions [48]. Lan et al. use contextual bandits to select 
the best next learning action by using an estimation of the 
student’s knowledge profile [16]. Many intelligent tutoring 
systems (ITS) use mastery learning within the Knowledge 
Tracing [8] framework: making students work on a given 
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skill until the system infers that they have mastered it. 


We can see that the traditional adaptive spacing framework 
already uses a spaced retrieval practice strategy to opti- 
mize the student’s learning time. However, it is not directly 
adapted to learning and memorization of skills. In this lat- 
ter case, specific items are the only way to practice one or 
multiple skills, because we do not have to memorize the con- 
tent directly. Students who master a skill should be able to 
generalize to unseen items that also involve that skill. In 
Section 3, we propose an extension of this original frame- 
work in order to apply adaptive spacing algorithms to the 
memorization of skills. 


2.2 Modeling student learning and forgetting 
The history of scientific literature on student modeling is 
particularly rich. In what follows, we focus on the subprob- 
lem of modeling student learning and forgetting based on 
student performance data. 


As Vie and Kashima recall [43], two main approaches have 
been used for modeling student learning and predicting stu- 
dent performance: Knowledge Tracing and Factor Analysis. 


Knowledge Tracing [8] models the evolution of a student’s 
knowledge state over time so as to predict a sequence of 
answers. The original and still most widespread model of 
Knowledge Tracing is Bayesian Knowledge Tracing (BKT). 
It is based on a Hidden Markov Model where the knowledge 
state of the student is the latent variable and skill mastery is 
assumed binary. Since its creation, it has been extended to 
overcome its limits and account for instance for individual 
differences between students [52]. More recently, Piech et al. 
replaced the original Hidden Markov Model framework with 
a Recurrent Neural Network and proposed a new Knowledge 
Tracing model called Deep Knowledge Tracing (DKT) [29]. 
Despite a mild controversy concerning the relevance of using 
deep learning in an educational setting [50], recent works 
continue to develop this line of research [53, 21]. 


Contrary to Knowledge Tracing, Factor Analysis does not 
originally take the order of the observations into account. 
IRT (Item Response Theory) [42] is the canonical model for 
Factor Analysis. In its simplest form, IRT reads: 


P(¥s,; = 1) = a(as — 45) 


with as ability of student s and 6, difficulty of item j7. One 
of the main assumptions of IRT is that the student ability is 
static and cannot change over time or with practice. Despite 
its apparent simplicity, IRT has proven to be a robust and 
reliable EDM model, even outperforming much more com- 
plex architectures such as DKT [49]. IRT can be extended 
to represent user and item biases with vectors instead of 
scalars. This model is called MIRT, for Multidimensional 
Item Response Theory: 


P(Y¥s,j = 1) =o ((as, 6;) + dj). 


In this case, a; and 6; are d-dimensional vectors, and d; is a 
scalar that captures the easiness of item j. (.,.) is the usual 
dot product between two vectors. 


More recent works incorporated temporality in Factor Anal- 
ysis models, by taking practice history into account. For 


instance, AFM (Additive Factor Model) [6] models: 


P(Y;,; =1)=0 ys: Br + YeGs,k 


ke KC(j) 


with 6, easiness of skill k and as, number of attempts of 
student s on KC k prior to this attempt. Performance Factor 
Analysis [27] (PFA) builds on AFM and uses past outcomes 
of practice instead of simple encounter counts: 


P(¥s3 =1 =o | S > Be+Kes,e + prfs,r 


ke KC(j) 


with c,, number of correct answers of student s on KC k 
prior to this attempt and fs, number of wrong answers of 
student s on KC k prior to this attempt. 


Ekanadham and Karklin take a step further to account for 
temporality in the IRT model and extend the two-parameter 
ogive IRT model (2PO model) by modeling the evolution of 
the student ability as a Wiener process [10]. However, they 
do not explicitly account for student memory decay. 


The recent framework of KTM (Knowledge Tracing Ma- 
chines) [43] encompasses several EDM models, including 
IRT, MIRT, AFM, and PFA. KTMs are based on factor- 
ization machines and model the probability of correctness 
as follows: 


N 
PY =l)=o] ut > wines + S> Lt iLt,e(Vi, Ve) 


i=l 1<i<l<N 


where yp is a global bias, N is the number of abstract fea- 
tures, be it item parameters, temporal features, etc., xt is a 
sample gathering all features collected at time t: which stu- 
dent answers which item, and information regarding prior 
attempts, w; is the bias of feature 7 and vu; € R? its em- 
bedding. The features involved in a sample x; are typically 
in sparse number, so this probability can be computed ef- 
ficiently. In KTM, one can recover several existing EDM 
models by selecting the appropriate features to consider in 
the modeling. For instance, if we consider user and item 
features only, we recover IRT. If we consider the skill fea- 
tures in the q-matrix, and the counter of prior successes and 
failures at skill level, we recover PFA. 


One of the very first works on human memory modeling 
dates back to 1885 and stems from Ebbinghaus [9]. He 
models the probability of recall of an item as an exponen- 
tial function of memory strength and delay since last review. 
More recently, Settles and Meeder propose an extension of 
the original exponential forgetting curve model, the half-life 
regression [38]. They estimate item memory strength as an 
exponential function of a set of features that contain infor- 
mation on the past practice history and on the item to re- 
member (lexeme tag features, in their case). More sophisti- 
cated memory models have also been proposed: for instance 
ACT-R (Adaptive Character of Thought—Rational) [2] and 
MCM (Multiscale Context Model) [25]. 


Walsh et al. [45] offer a comparison of three computational 
memory models: ACT-R declarative memory model [26], 
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Predictive Performance Equation (PPE) and a generaliza- 
tion of Search of Associative Memory (SAM). These models 
differ in how they predict the impact of spacing on sub- 
sequent relearning, after a long retention interval. PPE is 
the only one to predict that spacing may accelerate subse- 
quent relearning (“spacing accelerated relearning”) — an ef- 
fect that was empirically underlined by their experiment. 
PPE showed also superior fit to experimental data, com- 
pared to SAM and ACT-R. 


DASH [22, 18] bridges the gap between factor analysis and 
memory models. DASH stands for Difficulty, Ability, and 
Student History. Its formulation reads: 


P(¥s,j,2 = 1) =a(as — 6; + ho(ts,j,1:1; Ys,j,1:1-1)) 


with he a function parameterized by 6 (learned by DASH) 
that summarizes the effect of the | — 1 previous attempts 
where student s reviewed item j (ts,j,1..-1) and the binary 
outcomes of these attempts (ys,j,1:.-1). Their main choice 
for he is: 


W-1 
he (ts,j,1:1, Ys,j,1t—1) = a Oow41 log(1 + Cs,j,w) 
w=0 


eas O2w+2 log(1 + Qs,j,w) 


with w indexing a set of expanding time windows, Cs,j,w is 
the number of correct outcomes of student s on item 7 in 
time window w out of a total of as,;,~ attempts. The time 
windows w are not disjoint and span increasing time inter- 
vals. They allow DASH to account for both learning and 
forgetting processes. The use of log counts induces dimin- 
ishing returns of practice inside a given time window and 
difference of log counts formalizes a power law of practice. 
The time module hg is inspired by ACT-R [2] and MCM [25] 
memory models. 


We can see that Lindsey et al. [18] make use of the additive 
factor models framework for taking memory decay and the 
benefits of past practice into account. Their model outper- 
forms IRT and a baseline on their dataset COLT, with an 
accumulative prediction error metric. To avoid overfitting 
and making model training easier, they use a hierarchical 
Bayesian regularization. 


To the best of our knowledge, no knowledge tracing model 
accounts for both multiple skills tagging and memory decay. 
We intend to bridge this gap by extending DASH. 


3. FRAMING THE PROBLEM 


In our setting, the student learns to master a set of skills by 
sequentially interacting with an adaptive spacing system. 
At each iteration, this system selects an item (or exercise, 
or question) for the student, e.g. What is lim,_,o(sin x)/a ?. 
This selection is made by optimizing a utility function / that 
rewards long-term mastery of the set of KCs to learn. Then, 
the student answers the item and the system uses the cor- 
rectness of the answer to update its belief concerning the 
student memory and learning state on the skills involved by 
the item. Finally, the system provides the student a correc- 
tive feedback. 


In a nutshell, our present research goal is to maximize mas- 
tery and memory of a fixed set of skills among students dur- 


ing a given time interval while minimizing the time spent 
studying. 


We rely on the following assumptions: 


e information to learn and remember consists in a set of 
skills’ k € [1, K]; 


e skill mastery and memorization of student s at time t is 
measured by the ability of s to answer an (unseen) item 
involving that skill, i.e. by their ability to generalize 
to unseen material; 


e students first have access to some theoretical knowl- 
edge about skills, but learning happens with retrieval 
practice; 


e items are tagged with one or multiple skills and this in- 
formation is synthesized inside a binary q-matrix [41]; 


e students forget: skill mastery decreases as time goes 
by since last practice of that skill; 


Unlike Lindsey et al. [18], we do not assume that items 
involving skill k are interchangeable: their difficulties, for 
instance, may differ from one another. Thus, the selection 
phase is two-fold in that it requires to select the skill to 
practice and the item to present. In theory, there should be 
at least one item for practicing every skill k; in practice, one 
item would be too few, since the student would probably 
“overfit” on the item. This formalization easily encompasses 
the flashcards-based adaptive spacing framework: it only 
requires to associate every item with a distinct skill. This 
wipes out the need to select an item after the skill. 


Different utility functions | can be considered. For instance, 
Reddy, Levine and Dragan consider both the likelihood of 
recalling all items and the expected number of items recalled 
[33]. In our case, the utility function should account for the 
uncertainty of future items to answer. Indeed, if the goal 
of the user is to prepare for an exam, the system must take 
into account that the user will probably have to answer items 
that they did not train with. 


To tackle this problem, like previous work [22, 18], we choose 
to rely on a student learning and forgetting model. In our 
case, this model must be able to quantify mastery and mem- 
ory for any skill or combination of skills. In the next section, 
we present our main contribution: a new student learning 
and forgetting model, coined DAS3H. 


4. OUR MODEL DAS3H 

We now describe our new student learning and forgetting 
model: DAS3H stands for item Difficulty, student Ability, 
Skill, and Student Skill practice History, and builds on the 
DASH model presented in Section 2. Lindsey et al. [18] show 
that DASH outperforms a hierarchical Bayesian version of 
IRT on their experimental data, which consist in student- 
item interactions on a flashcard-based foreign (Spanish) 
language vocabulary reviewing system. They already talk 


4These skills may be organized into a graph of prerequisites, 
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about knowledge components, but they use this concept to 
cluster similar words together (e.g. all conjugations of a 
verb). Thus, in their setting, an item has exactly one knowl- 
edge component; different items can belong to the same 
knowledge component if they are close enough. As a con- 
sequence, their model formulation does not handle multiple 
skills item tagging, which is common in other disciplines such 
as in mathematics. Moreover, they assume that the impact 
of past practice on the probability of correctness does not 
vary across the skills: indeed, DASH has only two biases 
per time window w, 02w+41 for past wins and 02y+2 for past 
attempts. It may be a relevant assumption to prevent over- 
fitting when the number of skills is high, but at the same 
time it may degrade performance when the set of skills is 
very diverse and inhomogeneous. 


DAS3H extends DASH to items with multiple skills, and al- 
lows the influence of past practice on present performance 
to differ from one skill to another. One could argue that we 
could aggregate every existing combination of skills into a 
distinct skill to avoid the burden of handling multiple skills. 
However, this solution would not be satisfying since the re- 
sulting model would for instance not be able to capture item 
similarities between two items that share all but one skill 
in common. The use of a representation of multiple skills 
allows to account for knowledge transfer from one item to 
another. The item-skill relationships are usually synthesized 
by a q-matrix and generally require domain experts’ labor. 


We also leverage the recent Knowledge Tracing Machines 
framework [43] to enrich the DASH model by embedding 
the features in d dimensions and model pairwise interactions 
between those features. So far, KTMs have not been tried 
with memory features. 


In brief, we extend DASH in three ways: 


e Extension to handle multiple skills tagging: new tem- 
poral module hg that also takes the multiple skills into 
account. The influence of the temporal distribution 
of past practice and of the outcomes of these previous 
attempts may differ from one skill to another; 


e Estimation of easiness parameters for each item j and 
skill k; 


e Use of KTMs [43] instead of mere logistic regression. 


For an embedding dimension of d = 0, the quadratic term 
of KTM is cancelled out and our model DAS3H reads: 


P(¥.52=1)=o(a.-—5;+ >> Bet 
kEKC(j) 


+ ho (5,51: ¥s,j,1l-1))> 
Following Lindsey et al. [18], we choose: 


Ww-1 


ho(ts,j,1:1,¥s,j1id—1) = ye x Ox,2w+1 log(1 + Cs,k,w) 
keEKO(j) w=0 


— Og ,2w+2 log(1 + as,k,w). 


Thus, the probability of correctness of student s on item 7 
at time t depends on their ability as, the difficulty of the 
item 6; and the sum of the easiness 6; of the skills involved 


by item j. It also depends on the temporal distribution and 
the outcomes of past practice, synthesized by he. In he, w 
denotes the index of the time window, cs,x,~ denotes the 
amount of times that KC k has been correctly recalled in 
window w by student s earlier, as,k,~ the amount of times 
that KC k has been encountered in time window w by stu- 
dent s earlier. Intuitively, he can be seen as a sum of memory 
strengths, one for each skill involved in item 7. 


For higher embedding dimensions d > 0, in our implemen- 
tation we use probit as the link function. All features are 
embedded in d dimensions and their interaction is modeled 
in a pairwise manner. For a more thorough description of 
KTMs, see [43]. To implement a model within the KTM 
framework, one must decide which features to encode in the 
sparse x vector. In our case, we chose user s, item 7, skills 
k € KC(j), wins cs,k,w and attempts as,n,w for each time 
window w. 


Compared to DASH and if we forget about additional pa- 
rameters induced by the regularization scheme, DAS3H has 
(d+ 1)(K + 2W(K — 1)) more feature parameters to esti- 
mate. To avoid overfitting, we use additional hierarchical 
distributional assumptions for the parameters to estimate, 
as described in the next section. 


5. EXPERIMENTS 

To evaluate the performance of our model, we compared 
DAS3H to several state-of-the-art student models on three 
different educational datasets. These models have been de- 
tailed in Section 2. 


5.1 Experimental setting 

We perform 5-fold cross-validation at the student level for 
our experiments. This means that the student population is 
split into 5 disjoint groups and that cross-validation is made 
on this basis. This evaluation method, also used in [43], has 
the advantage to show how well an educational data mining 
model generalizes over previously unseen students. 


Following previous work [34, 43] we use hierarchical distri- 
butional assumptions when d > 0 to help model training 
and avoid overfitting. More precisely, each feature weight 
and feature embedding component follows a normal prior 
distribution M(yu,1/A) where y and 2 follow hyperpriors 
pe ~ N(0,1) and A ~ T(1,1). In their article [18], Lind- 
sey et al. took a similar approach but they assumed that 
the as and the 6; followed different distributions. Contrary 
to us, they did not regularize the parameters 0, associated 
with the practice history of a student: our situation is dif- 
ferent because we have more parameters to estimate than 
them. We use the same time windows as Lindsey et al. [18]: 
{1/24, 1, 7,30,-+00}. Time units are expressed in days. 


Our models were implemented in Python. Code for replicat- 
ing our results is freely available on Github®. Like Vie and 
Kashima [43], we used pywFM® as wrapper for Libfm’ [34] for 
models with d > 0. We used 300 iterations for the MCMC 


https: //github.com/BenoitChoffin/das3h 
Snttps://github.com/jfloff/pywFM 
"http: //libfm.org/ 


33 Proceedings of The 12th International Conference on Educational Data Mining (EDM 2019) 


Gibbs sampler. When d = 0, we used the scikit-learn [28] 
implementation of logistic regression with L2 regularization. 


We compared DAS3H to DASH, IRT, PFA, and AFM within 
the KTM framework, for three different embedding dimen- 
sions: 0, 5, and 20. When d > 0, IRT becomes MIRTb, 
a variant of MIRT that considers a user bias. We do not 
compare to DKT, due to the mild controversy over its per- 
formance [49, 50]. For DASH, we choose to consider item- 
specific biases, and not KC-specific biases: in their original 
setting, Lindsey et al. [18] aggregated items into equivalence 
classes and trained DASH on this basis. This is not always 
possible to us because items have in general multiple skill 
taggings; however, we tested this possibility in Subsection 
5.3 but it did not yield better results. 


We used three different datasets: ASSISTments 2012-2013 
(assist12) [11], Bridge to Algebra 2006-2007 (bridge06) 
and Algebra I 2005-2006 (algebra05) [39]. The two latter 
datasets stem from the KDD Cup 2010 EDM Challenge. 
The main problem for our experiments was that only few 
datasets that combine both time variables and multiple-KC 
tagging are publicly available. As a result, only both KDD 
Cup 2010 datasets have items that involve multiple KCs at 
the same time. As a further work, we plan to test DAS3H 
on datasets spanning more diverse knowledge domains and 
having more fine-grained skill taggings. In ASSISTments 
2012-2013, the problem_id variable was used for the items 
and for the KDD Cup datasets, the item variable came from 
the concatenation of the problem and the step IDs, as rec- 
ommended by the challenge organizers. 


We removed users for whom the number of interactions was 
less than 10. We also removed interactions with NaN skills, 
because we feared it would introduce too much noise. For 
the KDD Cup 2010 datasets, we removed interactions which 
seemed to be duplicates, i.e. for which the (user, item, times- 
tamp) tuple was duplicated. Finally, we sparsely encoded 
the features and computed the q-matrices. We detail the 
dataset characteristics (after preprocessing) in Table 1. The 
mean skill delay refers to the mean time interval (in days) 
between two interactions with the same skill, and the mean 
study period refers to the mean time difference between the 
last and the first interaction for each student. 


5.2. Results 


Detailed results can be found in Tables 2, 3 and 4, where 
mean area under the curve scores (AUC) and mean nega- 
tive log-likelihood (NLL) are reported for each model and 
dataset. Accuracy (ACC) is not reported by lack of space. 
We found that ACC was highly correlated with AUC and 
NLL; the interested reader can find it on the Github repos- 
itory containing code for the experiments®. Standard devi- 
ations over the 5 folds are also reported. We can see that 
our model DAS3H outperforms all other models on every 
dataset. 


5.3. Discussion 

Our experimental results show that DAS3H is able to more 
accurately model student performance when multiple skill 
and temporal information is at hand. We hypothesize that 


Sittps: //github.com/BenoitChoffin/das3h 


Figure 1: AUC boost when using time windows 
features instead of regular wins and attempts (all 
datasets). Higher is better. 
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this performance gain stems from a more complex temporal 
modeling of the influence of past practice of skills on current 
performance. 


The impact of the multidimensional embeddings and the 
pairwise interactions seems to be very small yet unclear, 
and should be further investigated. An embedding dimen- 
sion of d = 20 is systematically worse or among the worst 
for DAS3H on every dataset, but with a smaller d = 5, the 
performance is sometimes better than with d = 0. An inter- 
mediate embedding dimension could be preferable, but our 
results confirm those of Vie and Kashima [43]: the role of 
the dimension d seems to be limited. 


In order to make more sense of our results, we wanted to 
know what made DAS3H more predictive than its counter- 
parts. Our hypothesis was that taking the past temporal 
distribution of practice as well as the outcome of previous 
encounters with skills allowed the model to capture more 
complex phenomena than just simple practice, such as for- 
getting. To test this hypothesis, we performed some abla- 
tion tests. We empirically evaluated the difference in terms 
of AUC on our datasets when time windows features were 
used instead of regular features for wins and attempts. For 
each dataset, we compared the mean AUC score of the orig- 
inal DAS3H model with a similar model for which the time 
windows wins and attempts features were replaced with reg- 
ular wins and fails counts. Thus, the time module he was 
replaced with KEK Cs) YrCs,k + Prfs,k like in PFA. Since 
wins, fails and attempts are collinear, it does not matter to 
replace “wins and attempts” with “wins and fails”. The re- 
sults are plotted in Figure 1. Mean and standard deviations 
over 5 folds are reported. We chose an embedding dimen- 
sion d = 0 since it was in general the best on the previous 
experiments. We observe that using time window features 
consistently boosts the AUC of the model. 


We also wanted to know if assuming that skill practice ben- 
efits should differ from one skill to another was a useful 
assumption. Thus, we compared our original DAS3H for- 
mulation to a different version, closer to the DASH formula- 
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: : Mean Skills Mean Mean 
Dataset Users Items Skills Interactions correctness peritem skill delay study period 
assist 12 24,750 52,976 265 2,692,889 0.696 1.000 8.54 98.3 
bridge06 1,135 129,263 493 1,817,427 0.832 1.013 0.83 149.5 
algebra05 569 = 173,113 112 607,000 0.755 1.363 3.36 109.9 
Table 1: Datasets characteristics 
model dim AUC ¢ NLL J model dim AUC t NLL | 
DAS3H 0 0.826+0.003 0.414+0.011 DAS3H 5 0.7444 0.002 0.5314 0.001 
DAS3H 5 0.818+0.004 0.421+0.011 DAS3H 20 0.740+0.001 0.533 + 0.003 
DAS3H 20 = 0.817+0.005 0.422 + 0.007 DAS3H 0 0.739+0.001 0.534 + 0.002 
DASH 5 0.775+0.005 0.458 + 0.012 DASH 0 0.703 £0.002 0.557 + 0.004 
DASH 20 0.774+0.005 0.456+0.017 DASH 5 0.703 40.001 0.557+0.001 
DASH 0 0.773 40.002 0.454+ 0.006 DASH 20 0.703 £0.002 0.557 + 0.002 
IRT 0 0.771+40.007 0.456+0.015 IRT 0 0.702 +0.001 0.558 + 0.001 
MIRTb 20 0.770 40.007 0.460 + 0.007 MIRTb 20 0.701+0.001 0.558 + 0.001 
MIRTb 5 0.7704 0.004 0.459+0.011 MIRTb 5 0.701 +0.002 0.558 + 0.001 
PFA 0 0.744+0.004 0.481 + 0.004 PFA 5 0.669 +0.002 0.577+0.002 
AFM 0 0.707 + 0.005 0.499 + 0.006 PFA 20 0.668 £ 0.002 0.578 + 0.003 
PFA 20 0.670+0.010 1.008+ 0.047 PFA 0 0.668 £ 0.002 0.579 + 0.002 
PFA 5 0.664+0.010 1.107+0.079 AFM 5 0.610+0.001 0.597+0.001 
AFM 20 0.644+0.005 0.817+0.076 AFM 20 0.609+0.001 0.597 +0.003 
AFM 5 0.640+0.007 0.941 + 0.056 AFM 0 0.608 £ 0.002 0.598 + 0.002 


Table 2: Performance comparison on the Al- 
gebra 2005-2006 (PSLC DataShop) dataset. 
Metrics are averaged over 5 folds and standard 
deviations are reported. + and | respectively 
indicate that higher (lower) is better. 


tion, in which all skills share the same parameters 02w+1 and 
02w+2 inside a given time window w. We refer to this version 
of DAS3H as DAS3Hip. The results are given in Table 5. 
They show that using different parameters for different skills 
in hg increases AUC performance. The AUC gain varies be- 
tween +0.03 and +0.04. This suggests that some skills have 
significantly different learning and forgetting curves. 


One could argue also that this comparison between DAS3H 
and DASH is not totally accurate. In their papers, Lindsey 
et al. cluster similar items together to form disjoint knowl- 
edge components. This is not possible to perform directly 
for both KDD Cup datasets since some items have been 
tagged with multiple skills. Nevertheless, the ASSISTments 
2012-2013 dataset has only single-KC items. To evaluate 
whether considering the temporal distribution and the out- 
comes of past practice on the KCs (DASH [KC]) or on the 
items (DASH [items]) would be better, we compared these 
two DASH formulations on ASSISTments 2012-2013. De- 
tailed results can be found in Table 6. We see that DASH 
[items] and DASH [KC] have comparable performance. 


Finally, let us illustrate the results of DAS3H by taking two 
examples of KCs of Algebra I 2005-2006, one for which the 
estimated forgetting curve slope is steep, the other one for 
which it is more flat. As a proxy for the forgetting curve 
slope, we computed the difference of correctness probabili- 
ties when a “win” (i.e. a correct outcome when answering an 
item involving a skill) left a single time window. This differ- 
ence was computed for every skill, for every couple of time 


Table 3: Performance comparison on the AS- 
SISTments 2012-2013 dataset. Metrics are av- 
eraged over 5 folds and standard deviations 
are reported. ‘+ and | respectively indicate 
that higher (lower) is better. 


model dim AUC + NLL JI 
DAS3H 5 0.791 +0.005 0.369 + 0.005 
DAS3H 0 0.790 + 0.004 0.371 + 0.004 
DAS3H 20 = 0.776+40.023 0.387 + 0.027 
DASH 0 0.749+ 0.002 0.393 40.007 
DASH 20 0.74740.003 0.399 + 0.002 
IRT 0 0.747 + 0.002 0.393 40.007 
DASH 5 0.747 + 0.003 0.399 + 0.002 
MIRTb 5 0.746 + 0.002 0.398 + 0.006 
MIRTb 20  0.746+0.004 0.399 +0.007 
PFA 20 0.746+0.003 0.397 + 0.004 
PFA 5 0.744+0.007 0.402 + 0.007 
PFA 0 0.739 + 0.003 0.406 + 0.008 
AFM 5 0.706 + 0.002 0.411 + 0.004 
AFM 20 0.706 + 0.002 0.412 + 0.004 
AFM 0 0.692 + 0.002 0.423 + 0.006 


Table 4: Performance comparison on the Bridge to 
Algebra 2006-2007 (PSLC DataShop) dataset. Met- 
rics are averaged over 5 folds and standard devi- 
ations are reported. + and | respectively indicate 
that higher (lower) is better. 
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d bridge06 algebra05 assist 12 
= 0 0O.790+0.004 0.826+0.003 0.739+0.001 
2 5 0.791+0.005 0.818+0.004 0.7444 0.002 
Q 20 0.776+0.023 0.817+0.005 0.740 + 0.001 
Q 
= 0  0.757+0.003 0.789+0.009 0.701 £ 0.002 
fA 5 0.757+0.005 0.787+0.005 0.700 + 0.001 
a 20 0.757+0.003 0.789+0.006 0.701 (<1e-3) 


Table 5: AUC comparison on all datasets between 
DAS3H and DAS3Hip, a version of DAS3H for 
which the influence of past practice does not dif- 
fer from one skill to another. Standard deviations 
are reported. Higher is better. 


DASH d=0 d=5 d = 20 
items 0.703+40.002 0.703+0.001 0.703 + 0.002 
KC 0.702 + 0.001 0.701+0.001 0.701 + 0.001 


Table 6: AUC comparison on ASSISTments 2012- 
2013 between DASH [items] and DASH [KC]. Stan- 
dard deviations are reported. Higher is better. 


windows, and for every fold. The differences were then av- 
eraged over the 5 folds and over the different time windows, 
yielding for every skill the probability of correctness average 
decrease when a win leaves a single time window. One of 
the skills for which memory decays slowly concerns shading 
an area for which a given value is inferior to a threshold: 
in average and everything else being equal, the probability 
of correctness for an item involving this skill decreases by 
1.15% when a single “win” leaves a time window. Such a 
skill is indeed not difficult for a student to master with a 
few periodic reviews. On the contrary, the skill concerning 
the application of exponents is more difficult to remember 
as time goes by: for this KC, the correctness probability de- 
creases by 2.74% when a win leaves a time window. This is 
more than the double of the previous amount and is consis- 
tent with the description of the KC. 


In brief, we saw in this section that DAS3H outperforms 
the other EDM models to which we compared it — includ- 
ing DASH. Using time window features instead of regular 
skill wins and attempts counts and estimating different pa- 
rameters for different skills significantly boosts performance. 
Considering that DAS3H outperforms its ablated counter- 
parts and DASH, these results suggest that including both 
item-skill relationships and forgetting effect improves over 
models that consider one or the other. Using multidimen- 
sional embeddings, however, did not seem to provide richer 
feature representations, contrary to our expectations. 


Besides its performance, DAS3H has the advantage to be 
suited to the adaptive skill practice scheduling problem we 
described in Section 3. Indeed, it encapsulates an estimation 
of the current mastery of any skill and combination of skills 
for student s. It can also be used to infer its future evolution 
and thus, be leveraged to adaptively optimize a personalized 
skill practice schedule. 


6. CONCLUSION AND FUTURE WORK 


In this article, we first formulated a research framework 
for addressing the problem of optimizing human long-term 
memory of skills. More precisely, the knowledge to be re- 
membered here is applicative: we intend to maximize the 
period during which a human learner will be able to lever- 
age their retention of a skill they practiced to answer an item 
involving this skill. This framework assumes multiple skills 
tagging and is adapted to the more common flashcards-based 
adaptive review schedulers. 


We take a student modeling approach to start addressing 
this issue. As a first step towards an efficient skill practice 
scheduler for optimizing human long-term memory, we thus 
propose a new student learning and forgetting model coined 
DAS3H which extends the DASH model proposed by Lind- 
sey et al. [18]. Contrary to DASH, DAS3H allows each item 
to depend on an arbitrary number of knowledge components. 
Moreover, a bias for each skill temporal feature is estimated, 
whereas DASH assumed that item practice memory decayed 
at the same rate for every item. Finally, DAS3H is based 
on the recent Knowledge Tracing Machines model [43] be- 
cause feature embeddings and pairwise interactions between 
variables could provide richer models. To the best of our 
knowledge, KTMs have never been used with memory fea- 
tures so far. Finally, we showed that DAS3H outperforms 
several state-of-the-art EDM models on three real-world ed- 
ucational datasets that include information on timestamps 
and KCs. We showed that adding time windows features 
and assuming different learning and forgetting curves for 
different skills significantly boosts AUC performance. 


This work could be extended in different ways. First, the 
additive form of our model makes it compensatory. In other 
terms, if an item j involves two skills ki and ke, a student 
could compensate a small practice in ki by increasing their 
practice in ky. This is the so-called “explaining away” issue 
[47]. Using other non-affine models [15] could be relevant. 
Following Lindsey et al. [18], we used 5 time windows for 
DAS3H during our experiments: {1/24,1,7,30,+oo}. Fu- 
ture work could investigate the impact of alternative sets of 
time windows — for instance, with more fine-grained time 
scales. However, one should pay attention not to add too 
many parameters to estimate. 

Future work should also compare DAS3H and DASH to ad- 
ditional student models. For instance, R-PFA [12] (Recent- 
Performance Factor Analysis) and PFA-decay [13] extend 
and outperform PFA by leveraging a representation of past 
practice that puts more weight on more recent interactions. 
However, they do not explicitly take the temporal distribu- 
tion of past practice to predict future student performance. 
Other memory models, such as ACT-R [26] or MCM [25] 
could also be tested against DAS3H. Latency, or speed of 
recall, can serve as a proxy of retrieval difficulty and mem- 
ory strength [19]. It would be interesting to test whether 
incorporating this information inside DAS3H would result 
in better model performance. 

In a real-world setting, items generally involve multiple skills 
at the same time. In such a situation, how should one select 
the next item to recommend a user so as to maximize their 
long-term memory? The main issue here is that we want to 
anchor skills in their memory, not specific items. We could 
think of a two-step recommendation strategy: first, select- 
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ing the skill k* whose recall probability is closest to a given 
threshold (this strategy is consistent with the cognitive psy- 
chology literature, as Lindsey et al. recall [18]) and second, 
selecting an item among the pool of items that involve this 
skill. However, it could be impossible to find an item that in- 
volves only this skill k*. Also, precocious skill reactivations 
can have a harmful impact on long-term memory [7]. Thus, 
a strategy could be to compute a score (weighted according 
to the recall probability of each individual skill) for each skill 
combination in the q-matrix and to choose the combination 
for which the score is optimized. 

Finally, we tested our model on three real-world educa- 
tional datasets collected from automatic teaching systems 
on mathematical knowledge. To experiment with our model, 
we were indeed constrained in our choice of datasets, since 
few publicly available of them provide both information on 
the timestamps and the skills of the interactions. As fur- 
ther work, we intend to test our model on other datasets, 
from more diverse origins and concerning different knowl- 
edge domains. Collecting large, fine-grained and detailed 
educational datasets concerning diverse disciplines and mak- 
ing them publicly available would more generally allow EDM 
researchers to test richer models. 
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